Exam bubbles

GA Data Science Immersive : Project 1

Gwyneth Butera

Project description:
For our first project, we're going to take a look at SAT and ACT scores around the United States. Suppose that the College Board - the organization that administers the SAT - seeks to improve the participation rate of its exams. Your presentation should be geared toward non-technical executives with the College Board and you will use, at minimum, the provided data to make recommendations about how the College Board might work to increase the participation rates of these exams.


I. Load the data and perform basic operations.

In [1]:
# Let's get all our helper libraries loaded first

import numpy as np
import scipy.stats as stats
import seaborn as sns
import matplotlib.pyplot as plt
import pandas as pd
import warnings
warnings.filterwarnings('ignore')
sns.set_style("whitegrid", {'axes.grid' : False})

# %config InlineBackend.figure_format = 'retina'
%matplotlib inline
A. Load the data in using pandas.

We load the SAT and ACT data into two dataframes. We also load in state-by-state enrollment data scraped from "governing.com":governing.com.

In [2]:
# Read the data into DataFrames
sat = pd.read_csv("../data/sat.csv")
act = pd.read_csv("../data/act.csv")

# Enrollment data was scraped from 
# http://www.governing.com/topics/education/gov-fragmented-school-districts-consolidation.html
enroll = pd.read_csv("../data/enrollment.csv")
B. Print the first few rows of each dataframe.
In [3]:
sat.head(3)
Out[3]:
Unnamed: 0 State Participation Evidence-Based Reading and Writing Math Total
0 0 Alabama 5% 593 572 1165
1 1 Alaska 38% 547 533 1080
2 2 Arizona 30% 563 553 1116
In [4]:
act.head(3)
Out[4]:
Unnamed: 0 State Participation English Math Reading Science Composite
0 0 National 60% 20.3 20.7 21.4 21.0 21.0
1 1 Alabama 100% 18.9 18.4 19.7 19.4 19.2
2 2 Alaska 65% 18.7 19.8 20.4 19.9 19.8
C. Describe in words what each variable (column) is.

For the SAT:

  • Unamed: 0 : An index column
  • State: The state being considered
  • Participation: Percentage of students that took the exam
  • Evidence-Based Reading and Writing: Average score for EBRW section (200-800)
  • Math: Average score for math section (200-800)
  • Total: Average combined EBRW and math score

For the ACT:

  • Unamed: An index column
  • State: The state being considered
  • Participation: Percentage of students that participated
  • English: Average score (1-36)
  • Math: Average score (1-36)
  • Reading: Average score (1-36)
  • Science: Average score (1-36)
  • Composite: Average of composite score, where composite score is the average of the four area scores, rounded up to the nearest whole number (1-36)
D. Does the data look complete? Are there any obvious issues with the observations?
In [5]:
# act.info
print("There are {} NaNs in the dataframe.".format(act.isnull().sum().sum()))
act.describe()
There are 0 NaNs in the dataframe.
Out[5]:
Unnamed: 0 English Math Reading Science Composite
count 52.000000 52.000000 52.000000 52.000000 52.000000 52.000000
mean 25.500000 20.919231 21.173077 22.001923 21.040385 21.509615
std 15.154757 2.332132 1.963602 2.048672 3.151113 2.002083
min 0.000000 16.300000 18.000000 18.100000 2.300000 17.800000
25% 12.750000 19.000000 19.400000 20.475000 19.900000 19.800000
50% 25.500000 20.550000 20.900000 21.700000 21.150000 21.400000
75% 38.250000 23.300000 23.100000 24.125000 22.525000 23.600000
max 51.000000 25.500000 25.300000 26.000000 24.900000 25.500000
In [6]:
sat.describe()
print("There are {} NaNs in the dataframe.".format(act.isnull().sum().sum()))
There are 0 NaNs in the dataframe.

The data looks complete as there are no missing (null) entries.

However:

  • the minimum of science on the ACT science is very low at 2.3.
  • the minimum of math for the SAT is too low at 52. (The lowest possible is 200.)

We will investigate these 'abnormal' values a little further along in the notebook.

E. Are there any issues in the consistency of the data?
In [7]:
# Check consistency of data

# sat_sections = ["Math", "Evidence-Based Reading and Writing"]
# act_sections = ["Math", "English", "Reading", "Science"]

# SAT sections should add up to SAT total
check_sat = abs(sat.Math + sat["Evidence-Based Reading and Writing"] - sat.Total)
for ix,val in check_sat.iteritems():
    if val > 1:
        print("State {} ({}): math+ebrw-Total = {}".format(ix, sat.loc[ix,'State'], val))

# ACT sections should add up to add up to 4 times the ACT Composite
check_act = abs(act.Math + act.English + act.Reading + act.Science - act.Composite*4)
for ix,val in check_act.iteritems():
    if val > 1:
        print("State {} ({}): math + english + reading + science -4*Composite = {}"
              .format(ix, act.loc[ix,'State'], val))

State 20 (Maryland): math+ebrw-Total = 472
State 21 (Maryland): math + english + reading + science -4*Composite = 21.5

The scores for Maryland are inconsistent. We can change the incorrect scores to ensure the data is consistent..

In [8]:
# Clean data by hand
sat.loc[20, "Math"] = sat.Total[20] - sat.loc[20, "Evidence-Based Reading and Writing"]
act.loc[21, "Science"] = act.Composite[21]*4 - act.Reading[21] - act.English[21] - act.Math[21]
F. Print the types of each column.
In [9]:
# ACT column types
print("ACT column types:")
act.dtypes
ACT column types:
Out[9]:
Unnamed: 0         int64
State             object
Participation     object
English          float64
Math             float64
Reading          float64
Science          float64
Composite        float64
dtype: object
In [10]:
# SAT column types
print("SAT column types:")
sat.dtypes
SAT column types:
Out[10]:
Unnamed: 0                             int64
State                                 object
Participation                         object
Evidence-Based Reading and Writing     int64
Math                                   int64
Total                                  int64
dtype: object
F. Do any types need to be reassigned? If so, go ahead and do it.

We can do some cleaning:

  • Drop the Unnamed: 0 columns from SAT and ACT.
  • Keep only the State and Enroll columns in the enrollment dataframe.
  • Remove the National row from the ACT dataframe.
  • Set State as the index in both testing dataframes.
  • Change the participation rate to a floating point value in both testing dataframes.
  • Adjust the enrollment to be 1/13 of the total K-12 enrollment of each state.

Ideas for improvement:

  • Change Participation from out of 100 to be out of 1.0 (divide by 100).
In [11]:
# Remove the unamed columns from both and the national from act
sat.drop('Unnamed: 0', axis=1, inplace=True)
act.drop('Unnamed: 0', axis=1, inplace=True)

# Only keep columns we need
enroll = enroll[['State','Enroll']]

# Change the row index to be state
sat.set_index("State", inplace=True)
act.set_index("State", inplace=True)
enroll.set_index("State", inplace=True)

# Save the national row just in case
act_national = act.loc["National",:]
act.drop('National', axis=0, inplace=True)


# Change participation of SAT, ACT to be a float
sat['Participation'] = sat['Participation'].map(lambda cell: cell.replace('%',''))
act['Participation'] = act['Participation'].map(lambda cell: cell.replace('%',''))

sat['Participation'] = sat['Participation'].astype(float)
act['Participation'] = act['Participation'].astype(float)

enroll['Enroll'] = enroll['Enroll'].map(lambda cell: cell.replace(',',''))
enroll['Enroll'] = enroll['Enroll'].astype(float)

# Approximation of number of seniors
enroll['Enroll'] = enroll['Enroll']/13
G. Merge the dataframes on the state column.
In [12]:
# Merge the data on the state column
df = pd.merge(sat, act, on='State')
df = pd.merge(df,enroll, on='State')
H. Change the names of the columns so you can distinguish between the SAT columns and the ACT columns.
In [13]:
# Rename the columns 
df = df.rename(columns={
    'Participation_x': 'Participation_SAT',
    'Evidence-Based Reading and Writing' : "EBRW_SAT",
    'Math_x' : 'Math_SAT',
    'Total' : 'Total_SAT',
    'Math_y' : 'Math_ACT',
    'English' : 'English_ACT',
    'Reading' : 'Reading_ACT',
    'Science' : 'Science_ACT',
    'Composite' : 'Composite_ACT',
    'Participation_y': 'Participation_ACT'
})
I. Approximate the the number of students taking each test in each state.
In [14]:
# Approximate number students taking test
df['Participation_SAT_count'] = df.Enroll * df.Participation_SAT /100
df['Participation_ACT_count'] = df.Enroll * df.Participation_ACT /100
J. Print the minimum and maximum of each numeric column in the data frame.
In [15]:
for c in df.columns:
    print("The min and max of column {} is {} and {}.".format(c,df[c].min(), df[c].max()))
The min and max of column Participation_SAT is 2.0 and 100.0.
The min and max of column EBRW_SAT is 482 and 644.
The min and max of column Math_SAT is 468 and 651.
The min and max of column Total_SAT is 950 and 1295.
The min and max of column Participation_ACT is 8.0 and 100.0.
The min and max of column English_ACT is 16.3 and 25.5.
The min and max of column Math_ACT is 18.0 and 25.3.
The min and max of column Reading_ACT is 18.1 and 25.9.
The min and max of column Science_ACT is 18.2 and 24.7.
The min and max of column Composite_ACT is 17.8 and 25.4.
The min and max of column Enroll is 3457.076923076923 and 461219.53846153844.
The min and max of column Participation_SAT_count is 156.2969230769231 and 244446.35538461537.
The min and max of column Participation_ACT_count is 1079.4092307692308 and 170653.36153846156.
K. Write a function using only list comprehensions, no loops, to compute standard deviation. Using this function, calculate the standard deviation of each numeric column in both data sets. Add these to a list called sd.

$$\sigma = \sqrt{\frac{1}{n}\sum_{i=1}^n(x_i - \mu)^2}$$

Turn the list sd into a new observation in your dataset.
In [16]:
# Function to calculate standard deviation
def my_std(values):
    mu = np.mean(values)
    return np.sqrt(sum([(v-mu)**2 for v in values])/len(values))

sd = df.apply(lambda x: my_std(x))


II. Manipulate the dataframe

A. Sort the dataframe by the values in a numeric column (e.g. observations descending by SAT participation rate)
In [17]:
# Sort DataFrame 
df = df.sort_values(by="Participation_SAT", ascending=False)
df.head()
Out[17]:
Participation_SAT EBRW_SAT Math_SAT Total_SAT Participation_ACT English_ACT Math_ACT Reading_ACT Science_ACT Composite_ACT Enroll Participation_SAT_count Participation_ACT_count
State
Michigan 100.0 509 495 1005 29.0 24.1 23.7 24.5 23.8 24.1 103645.384615 103645.384615 30057.161538
Connecticut 100.0 530 512 1041 31.0 25.5 24.6 25.6 24.6 25.2 39602.307692 39602.307692 12276.715385
Delaware 100.0 503 492 996 18.0 24.1 23.4 24.8 23.6 24.1 9267.307692 9267.307692 1668.115385
District of Columbia 100.0 482 468 950 32.0 24.4 23.5 24.9 23.5 24.2 3457.076923 3457.076923 1106.264615
Maine 95.0 513 499 1012 8.0 24.2 24.0 24.8 23.7 24.3 13492.615385 12817.984615 1079.409231
In [18]:
# Sort DataFrame 
df = df.sort_values(by="Participation_ACT", ascending=False)
# df.head(18)
df.head()
Out[18]:
Participation_SAT EBRW_SAT Math_SAT Total_SAT Participation_ACT English_ACT Math_ACT Reading_ACT Science_ACT Composite_ACT Enroll Participation_SAT_count Participation_ACT_count
State
Mississippi 2.0 634 607 1242 100.0 18.2 18.1 18.8 18.8 18.6 31545.307692 630.906154 31545.307692
Montana 10.0 605 591 1196 100.0 19.0 20.2 21.0 20.5 20.3 11079.538462 1107.953846 11079.538462
North Carolina 49.0 546 535 1081 100.0 17.8 19.3 19.6 19.3 19.1 110758.153846 54271.495385 110758.153846
South Carolina 50.0 543 521 1064 100.0 17.5 18.6 19.1 18.9 18.7 55920.384615 27960.192308 55920.384615
Oklahoma 7.0 530 517 1047 100.0 18.5 18.8 20.1 19.6 19.4 51326.615385 3592.863077 51326.615385
In [19]:
# Sort DataFrame 
columns = ["Enroll", "Participation_SAT", "Participation_ACT", "Participation_SAT_count", "Participation_ACT_count"]
df_enroll = pd.DataFrame(df, columns=columns)
df_enroll = df_enroll.sort_values(by="Enroll", ascending=False)
df_enroll.head(15)
Out[19]:
Enroll Participation_SAT Participation_ACT Participation_SAT_count Participation_ACT_count
State
California 461219.538462 53.0 31.0 244446.355385 142978.056923
Texas 379229.692308 62.0 45.0 235122.409231 170653.361538
Florida 206288.538462 83.0 73.0 171219.486923 150590.633077
Illinois 155797.769231 9.0 93.0 14021.799231 144891.925385
Georgia 130706.538462 61.0 55.0 79730.988462 71888.596154
New York 125042.615385 67.0 31.0 83778.552308 38763.210769
Ohio 123184.076923 12.0 75.0 14782.089231 92388.057692
Pennsylvania 122816.000000 65.0 23.0 79830.400000 28247.680000
North Carolina 110758.153846 49.0 100.0 54271.495385 110758.153846
Michigan 103645.384615 100.0 29.0 103645.384615 30057.161538
New Jersey 102721.692308 70.0 34.0 71905.184615 34925.375385
Virginia 96868.846154 65.0 29.0 62964.750000 28091.965385
Washington 80552.461538 64.0 29.0 51553.575385 23360.213846
Indiana 77513.153846 63.0 35.0 48833.286923 27129.603846
Tennessee 75572.615385 5.0 100.0 3778.630769 75572.615385
In [20]:
print("Approximate participation counts:")
print("ACT: {}".format(int(df.Participation_ACT_count.sum())))
print("SAT: {}".format(int(df.Participation_SAT_count.sum())))
Approximate participation counts:
ACT: 2114471
SAT: 1628130

Observations:

  • Four states have 100% Participation in the SAT.
  • Seventeen states have 100% Participation in the ACT.
  • There are large states where neither has 100% participation.
B. Use a boolean filter to display only observations with a score above a certain threshold (e.g. only states with a participation rate above 50%)
In [21]:
df[df["Participation_SAT"] > 70]
Out[21]:
Participation_SAT EBRW_SAT Math_SAT Total_SAT Participation_ACT English_ACT Math_ACT Reading_ACT Science_ACT Composite_ACT Enroll Participation_SAT_count Participation_ACT_count
State
Florida 83.0 520 497 1017 73.0 19.0 19.4 21.0 19.4 19.8 206288.538462 171219.486923 150590.633077
Idaho 93.0 513 493 1005 38.0 21.9 21.8 23.0 22.1 22.3 21314.153846 19822.163077 8099.378462
District of Columbia 100.0 482 468 950 32.0 24.4 23.5 24.9 23.5 24.2 3457.076923 3457.076923 1106.264615
Connecticut 100.0 530 512 1041 31.0 25.5 24.6 25.6 24.6 25.2 39602.307692 39602.307692 12276.715385
Massachusetts 76.0 555 551 1107 29.0 25.4 25.3 25.9 24.7 25.4 60351.461538 45867.110769 17501.923846
Michigan 100.0 509 495 1005 29.0 24.1 23.7 24.5 23.8 24.1 103645.384615 103645.384615 30057.161538
Rhode Island 71.0 539 524 1062 21.0 24.0 23.3 24.7 23.4 24.0 9394.615385 6670.176923 1972.869231
Delaware 100.0 503 492 996 18.0 24.1 23.4 24.8 23.6 24.1 9267.307692 9267.307692 1668.115385
Maine 95.0 513 499 1012 8.0 24.2 24.0 24.8 23.7 24.3 13492.615385 12817.984615 1079.409231


III. Visualize the data

A. Using MatPlotLib and PyPlot, plot the distribution of the Rate columns for both SAT and ACT using histograms. (You should have two histograms. You might find [this link](https://matplotlib.org/users/pyplot_tutorial.html#working-with-multiple-figures-and-axes) helpful in organizing one plot above the other.)
In [22]:
sat_color = "#32b165"
act_color = "#a38cf4"
reg_color = '#f67088'
reg2_color = '#96a331'

# Set up the matplotlib figure
f, axes = plt.subplots(ncols=2, figsize=(15, 5))
sns.despine(left=True)
plt.subplots_adjust(hspace = .5)

ax=axes[0]
ax.set_title('SAT participation rates')
sns.distplot(df.Participation_SAT, bins=10, color=sat_color,kde=False, ax=ax)
#, hist_kws={"color": current_palette})
ax.set_xlabel('Participation Rates')
ax.set_ylabel('Number of states');

ax=axes[1]
ax.set_title('ACT participation rates')
sns.distplot(df.Participation_ACT, bins=10, color=act_color, kde=False, ax=ax)
#, hist_kws={"color": current_palette})
ax.set_xlabel('Participation Rates')
ax.set_ylabel('Number of states');
B. Plot the Math(s) distributions from both data sets.
In [23]:
# Set up the matplotlib figure
f, axes = plt.subplots(2,2 , figsize=(15, 10))
sns.despine(left=True)
plt.subplots_adjust(hspace = .5)

ax=axes[0,0]
ax.set_title('Math SAT scores')
sns.distplot(df.Math_SAT, bins=10, color=sat_color,kde=False, ax=ax)
ax.set_xlabel('Math Scores');

ax=axes[0,1]
ax.set_title('Math ACT scores')
sns.distplot(df.Math_ACT, bins=10, color=act_color,kde=False, ax=ax)
ax.set_xlabel('Math Scores');

ax=axes[1,0]
ax.set_title('Science ACT scores')
sns.distplot(df.Science_ACT, bins=10, color=act_color,kde=False, ax=ax)
ax.set_xlabel('Science Scores');

ax=axes[1,1]
ax.set_visible(False)
C. Plot the Verbal distributions from both data sets.
In [24]:
# Set up the matplotlib figure
f, axes = plt.subplots(2,2, figsize=(15, 10))
sns.despine(left=True)
plt.subplots_adjust(hspace = .5)

ax=axes[0,0]
ax.set_title('EBRW SAT scores');
sns.distplot(df.EBRW_SAT, bins=20, color=sat_color, kde=False, ax=ax);
ax.set_xlabel('EBRW SAT Scores');

ax=axes[0,1]
ax.set_title('English ACT scores')
sns.distplot(df.English_ACT, bins=10, color=act_color,kde=False, ax=ax)
ax.set_xlabel('English ACT Scores');

ax=axes[1,0]
ax.set_title('Reading ACT scores');
sns.distplot(df.Reading_ACT, bins=20, color=act_color, kde=False, ax=ax);
ax.set_xlabel('Reading ACT scores');

ax=axes[1,1]
ax.set_visible(False)
In [25]:
# Set up the matplotlib figure
f, axes = plt.subplots(ncols=2, figsize=(15, 5))
sns.despine(left=True)
plt.subplots_adjust(hspace = .5)

ax=axes[0]
ax.set_title('Total SAT scores')
sns.distplot(df.Total_SAT, bins=10, color=sat_color, kde=False, ax=ax);
ax.set_xlabel('Total Scores');

ax=axes[1]
ax.set_title('Composite ACT scores');
sns.distplot(df.Composite_ACT, bins=10, color=act_color, kde=False, ax=ax);
ax.set_xlabel('Composite Scores');
D. When we make assumptions about how data are distributed, what is the most common assumption? Does this assumption hold true for any of our columns? Which?

The most common assumption is that the data is normally distributed. This somesn't seem to hold true for any of our columns.

E. Plot some scatterplots examining relationships between all variables.
In [26]:
# Set up the matplotlib figure
f, axes = plt.subplots(3,2, figsize=(15, 15))
sns.despine(left=True)
plt.subplots_adjust(hspace = .5)

ax=axes[0,0]
ax.set_title("Participation rates by state");
sns.regplot(x="Participation_SAT", y="Participation_ACT", data=df, color=reg_color, ax=ax)
ax.set_xlabel("SAT")
ax.set_ylabel("ACT")

ax=axes[0,1]
ax.set_title("Participation counts by state");
sns.regplot(x="Participation_SAT_count", y="Participation_ACT_count", data=df, color=reg2_color, ax=ax);
ax.set_xlabel("SAT")
ax.set_ylabel("ACT")

ax=axes[1,0]
ax.set_title("English score comparison for SAT vs ACT");
ax.scatter(df.EBRW_SAT, df.English_ACT, color=reg_color)
ax.scatter(df.EBRW_SAT, df.Reading_ACT, color=reg2_color)
ax.set_xlabel("SAT EBRW")
ax.set_ylabel("ACT English and Reading")
ax.legend()

ax=axes[1,1]
ax.set_title("Math score comparison for each state");
ax.scatter(df.Math_SAT, df.Math_ACT, color=reg_color)
ax.scatter(df.Math_SAT, df.Science_ACT, color=reg2_color)
ax.set_xlabel("SAT Math")
ax.set_ylabel("ACT Math and ScienceT");
ax.legend()

ax=axes[2,0]
ax.set_title("Overall score comparison by state");
ax.scatter(df.Total_SAT, df.Composite_ACT, color=reg_color)
ax.set_xlabel("Total SAT")
ax.set_ylabel("Composite ACT")
ax.legend()

ax=axes[2,1]
ax.set_visible(False)
In [27]:
# Set up the matplotlib figure
f, axes = plt.subplots(ncols=2, figsize=(15, 5))
sns.despine(left=True)
plt.subplots_adjust(hspace = .5)

ax=axes[0]
ax.set_title('SAT score comparisions')
ax.scatter(df.Total_SAT, df.EBRW_SAT, color=reg_color);
ax.scatter(df.Total_SAT, df.Math_SAT, color=reg2_color)
ax.set_xlabel('Total SAT');
ax.set_ylabel("Section scores")
ax.legend()

ax=axes[1]
ax.set_title('ACT score comparision');
ax.scatter(df.Composite_ACT, df.English_ACT, color=sat_color);
ax.scatter(df.Composite_ACT, df.Math_ACT, color=act_color)
ax.scatter(df.Composite_ACT, df.Science_ACT, color=reg_color)
ax.scatter(df.Composite_ACT, df.Reading_ACT, color=reg2_color)
ax.set_xlabel('Composite ACT');
ax.legend();
In [28]:
# Set up the matplotlib figure
f, axes = plt.subplots(ncols=2, figsize=(15, 5))
sns.despine(left=True)
plt.subplots_adjust(hspace = .5)

ax=axes[0]
ax.set_title('SAT total score to participation rates');
sns.regplot(x="Participation_SAT", y="Total_SAT", data=df, color=sat_color, scatter=True, ax=ax)
ax.set_xlabel('Participation rate')
ax.set_ylabel("Total SAT scores")

ax=axes[1]
ax.set_title("ACT composite to participation rates")
sns.regplot(x="Participation_ACT", y="Composite_ACT", data=df, color=act_color, scatter=True, ax=ax)
ax.set_xlabel("Participation rate")
ax.set_ylabel("Composite ACT scores");
In [29]:
# While fun, this doens't add much to our observations
# sns.pairplot(df);
F. Are there any interesting relationships to note?
  • The states with higher participation rates have lower average scores.
  • Participation rates of the two tests are inversely correlated.
  • Additionaly, the scores on the sections of the SAT are positively correlated with the total score (not surprising).

  • While I added approximate enrollment data, I did not have time to fully explore relationships between enrollment and the other data.

G. Create box plots for each variable.
In [30]:
ACT_scores = ["Math_ACT", "English_ACT", "Reading_ACT", "Science_ACT", "Composite_ACT"]
SAT_scores = ["Math_SAT", "EBRW_SAT"]
participation_rates = ["Participation_SAT", "Participation_ACT"]

# Set up the matplotlib figure
f, axes = plt.subplots(ncols=2, figsize=(15, 5))
sns.despine(left=True)
plt.subplots_adjust(hspace = .5)

ax=axes[0]
ax.set_title("SAT Scores")
df.boxplot(SAT_scores, ax=ax)

ax=axes[1]
ax.set_title("ACT Scores")
df.boxplot(ACT_scores, ax=ax);
In [31]:
# Compare participation rates
plt.title("participation_rates")
df.boxplot(participation_rates);
H. Create a US heat map for participation rates by state.
In [32]:
# Modified from example code from plotly

import plotly
import plotly.plotly as py
import pandas as pd

plotly.tools.set_credentials_file(username='gbkgwyneth', api_key='NS01qFBPZLMjc5Kb6AEa')

scl = [[0.0, 'rgb(147, 208, 191)'],[0.2, 'rgb(114, 191, 196)'],[0.4, 'rgb(82, 174, 201)'],\
             [0.6, 'rgb(64, 149, 181)'],[0.8, 'rgb(59, 116, 138)'],[1.0, 'rgb(55, 83, 94)']]

# From Stack Overflow
# https://stackoverflow.com/questions/48979352/choropleth-map-in-python-using-plotly-without-state-codes
state_codes = {
    'District of Columbia' : 'DC','Mississippi': 'MS', 'Oklahoma': 'OK',
    'Delaware': 'DE', 'Minnesota': 'MN', 'Illinois': 'IL', 'Arkansas': 'AR',
    'New Mexico': 'NM', 'Indiana': 'IN', 'Maryland': 'MD', 'Louisiana': 'LA',
    'Idaho': 'ID', 'Wyoming': 'WY', 'Tennessee': 'TN', 'Arizona': 'AZ',
    'Iowa': 'IA', 'Michigan': 'MI', 'Kansas': 'KS', 'Utah': 'UT',
    'Virginia': 'VA', 'Oregon': 'OR', 'Connecticut': 'CT', 'Montana': 'MT',
    'California': 'CA', 'Massachusetts': 'MA', 'West Virginia': 'WV',
    'South Carolina': 'SC', 'New Hampshire': 'NH', 'Wisconsin': 'WI',
    'Vermont': 'VT', 'Georgia': 'GA', 'North Dakota': 'ND',
    'Pennsylvania': 'PA', 'Florida': 'FL', 'Alaska': 'AK', 'Kentucky': 'KY',
    'Hawaii': 'HI', 'Nebraska': 'NE', 'Missouri': 'MO', 'Ohio': 'OH',
    'Alabama': 'AL', 'Rhode Island': 'RI', 'South Dakota': 'SD',
    'Colorado': 'CO', 'New Jersey': 'NJ', 'Washington': 'WA',
    'North Carolina': 'NC', 'New York': 'NY', 'Texas': 'TX',
    'Nevada': 'NV', 'Maine': 'ME'}

state_df = pd.DataFrame.from_dict(state_codes, orient='index')

state_df.rename(columns={ 0:'Code'}, inplace=True)
df_heat = pd.concat([df,state_df], axis=1)
for col in df_heat.columns:
     df_heat[col] = df_heat[col].astype(str)

def heatmap(df, col, title, scl):
    data = [ dict(
             type='choropleth',
             colorscale = scl,
             autocolorscale = False,
             locations = df['Code'],
             z = df[col].astype(float),
             locationmode = 'USA-states',
             text = df['Code'],
             marker = dict(
                line = dict (
                    color = 'rgb(255,255,255)',
                    width = 2
                ) ),
            colorbar = dict(
                title = "Rate")
            ) ]

    layout = dict(
            title = title,
            geo = dict(
                scope='usa',
                projection=dict( type='albers usa' ),
                showlakes = True,
                lakecolor = 'rgb(255, 255, 255)'),
                 )

    fig = dict( data=data, layout=layout )

    return fig
SAT Participation Rate by State
In [53]:
fig_sat = heatmap(df_heat,"Participation_SAT", 'SAT Participation Rate by State', scl)
py.iplot( fig_sat, filename='d3-cloropleth-map' )
High five! You successfully sent some data to your account on plotly. View your plot in your browser at https://plot.ly/~gbkgwyneth/0 or inside your plot.ly account where it is named 'd3-cloropleth-map'
Out[53]:
ACT Participation Rate by State
In [54]:
fig_act = heatmap(df_heat,"Participation_ACT", 'ACT Participation Rate by State', scl)
py.iplot( fig_act, filename='d3-cloropleth-map' )
High five! You successfully sent some data to your account on plotly. View your plot in your browser at https://plot.ly/~gbkgwyneth/0 or inside your plot.ly account where it is named 'd3-cloropleth-map'
Out[54]:
SAT Participation Count by State
In [55]:
fig_sat_count = heatmap(df_heat,"Participation_SAT_count", 'SAT Participation Count by State', scl)
py.iplot( fig_sat_count, filename='d3-cloropleth-map' )
High five! You successfully sent some data to your account on plotly. View your plot in your browser at https://plot.ly/~gbkgwyneth/0 or inside your plot.ly account where it is named 'd3-cloropleth-map'
Out[55]:
ACT Participation Count by State
In [56]:
fig_act_count = heatmap(df_heat,"Participation_ACT_count", 'ACT Participation Count by State', scl)
py.iplot( fig_act_count, filename='d3-cloropleth-map' )
High five! You successfully sent some data to your account on plotly. View your plot in your browser at https://plot.ly/~gbkgwyneth/0 or inside your plot.ly account where it is named 'd3-cloropleth-map'
Out[56]:
Enrollment by State
In [57]:
fig_enroll = heatmap(df_heat,"Enroll", 'Enrollment by State', scl)
py.iplot( fig_enroll, filename='d3-cloropleth-map' )
High five! You successfully sent some data to your account on plotly. View your plot in your browser at https://plot.ly/~gbkgwyneth/0 or inside your plot.ly account where it is named 'd3-cloropleth-map'
Out[57]:

New Hampshire and Vermot did not provide data, so they are colored with black.


IV. Descriptive and Inferential Statistics

A. Summarize each distribution. As data scientists, be sure to back up these summaries with statistics. (Hint: What are the three things we care about when describing distributions?)
In [38]:
# The center, the shape, and the spread.
mu=pd.Series(df.mean(), name="Mean")
std = pd.Series(df.std(), name="Std Dev")
pd.concat([mu,std], axis=1)
Out[38]:
Mean Std Dev
Participation_SAT 38.244898 34.921657
EBRW_SAT 570.020408 46.279265
Math_SAT 557.755102 47.781853
Total_SAT 1127.857143 93.761222
Participation_ACT 66.959184 31.610230
English_ACT 20.791837 2.284261
Math_ACT 21.063265 1.918625
Reading_ACT 21.883673 1.996178
Science_ACT 21.357143 1.697670
Composite_ACT 21.395918 1.953838
Enroll 72622.262166 85052.586064
Participation_SAT_count 33227.145699 55269.378465
Participation_ACT_count 43152.486641 41258.103009

The distributions of the scores and participation rates are not normal and not close to uniform. I would like to have more data.

B. Summarize each relationship. Be sure to back up these summaries with statistics.

There is an inverse relationship between participation rates of the two tests. Also, average section scores on the tests are related to the average overall scores by state.

C. Execute a hypothesis test comparing the SAT and ACT participation rates. Use $\alpha = 0.05$. Be sure to interpret your results.
In [39]:
t_stat, p_value = stats.ttest_ind(df.Participation_ACT,df.Participation_SAT)
print("The t-statistic is {}.".format(t_stat))
print("The p-value is {}.".format(p_value))
The t-statistic is 4.267209377921401.
The p-value is 4.643980268715058e-05.
D. Plot the confidence interval
In [40]:
# Plot the confidence interval

# Taken from lecture notebook
# Generate points on the x axis between -4 and 4:
xpoints = np.linspace(-4, 4, 500)

# Use `stats.t.pdf` to get values on the probability density function for the t-distribution.
# The second argument is the degrees of freedom: n1 + n2 - 2.
ypoints = stats.t.pdf(xpoints,
                      len(df.Participation_ACT)+len(df.Participation_SAT) - 2,
                      0, 1)

# Initialize a `matplotlib` "figure."
fig = plt.figure(figsize=(8,5))

# Get the current "axis" out of the figure.
ax = fig.gca()

# Plot the lines using `matplotlib`'s plot function:
ax.plot(xpoints, ypoints, linewidth=3, color=reg2_color)

# Plot a vertical line for our measured difference in rates' t-statistic.
ax.axvline(t_stat, color=reg_color, linestyle='--', lw=5)
ax.axvline(-t_stat, color=reg_color, linestyle='--', lw=5);

Since the p_vale is small, we can reject the null hypothesis here. There is a correlation between the participation rates in the ACT and the SAT. In states where ACT participation is high, SAT participation is low.

E. Generate and interpret 95% confidence intervals for SAT and ACT participation rates.
In [41]:
n = df.shape[0]
participation_act_mean = df['Participation_ACT'].mean()
participation_act_std_error = df['Participation_ACT'].std()/np.sqrt(n)

interval_act = stats.norm.interval(0.95, participation_act_mean, participation_act_std_error)
print("The 95% confidence interval for the participation of the ACT is [{},{}].".format(interval_act[0],interval_act[1]))
The 95% confidence interval for the participation of the ACT is [58.10848194560994,75.80988540132883].
In [42]:
participation_sat_mean = df['Participation_SAT'].mean()
participation_sat_std_error = df['Participation_SAT'].std()/np.sqrt(n)

interval_sat = stats.norm.interval(0.95, participation_sat_mean, participation_sat_std_error)
print("The 95% confidence interval for the participation rate of the SAT is [{},{}].".format(interval_sat[0],interval_sat[1]))
The 95% confidence interval for the participation rate of the SAT is [28.46701374869913,48.02278216966822].

This is not good data to use since the populations of the states vary so greatly. Without including the enrollment data in the analysis, the test here lacks much meaning.

F. Given your answer to 26, was your answer to 27 surprising? Why?

I'm not sure what the confidence intervals for the state participation rates for the SAT and ACT reflect, beacuse the data is not normalized by population sizes.

G. Is it appropriate to generate correlation between SAT and ACT math scores? Why?
In [43]:
np.correlate(df.Participation_ACT, df.Participation_SAT)
Out[43]:
array([ 81309.])

Did you say correlations? Let's make a heat map!

In [44]:
# Heat map of colmns 
cols = ['Enroll','Participation_SAT','Participation_SAT_count','Participation_ACT','Participation_ACT_count','Total_SAT','Composite_ACT',]
colors = sns.color_palette("GnBu_d", 8)
plt.figure(figsize=(10,10))
sns.set_palette(colors)
# sns.heatmap(df_train[top_corr].corr(),vmin=-1,vmax=1, cmap = colors);
sns.heatmap(df[cols].corr(),vmin=-1,vmax=1, cmap = colors, annot=True);

Maybe for states that have similar participation rates for both tests it would be appropriate to generate the correlation. But since states with high rates of participation have lower scores in general, it is probably not appropriate to compare in states where the participation rates are very different.

H. Suppose we only seek to understand the relationship between SAT and ACT data in 2017. Does it make sense to conduct statistical inference given the data we have? Why?

The samples are not random, so I think that statistical inference is not appropriate.

In [45]:
# Setting up colors for the notebook
sns.palplot(sns.color_palette("husl", 8))
colors = ['#f67088', '#ce8f31', '#96a331', '#32b165', '#35aca4', '#38a7d0', '#a38cf4', '#f461dd']